fix(store): scope FSM_SYNC_MODE=nosync to raft-apply callers only#600
fix(store): scope FSM_SYNC_MODE=nosync to raft-apply callers only#600
Conversation
Previously ApplyMutations/DeletePrefixAt used s.fsmApplyWriteOpts unconditionally, which made ELASTICKV_FSM_SYNC_MODE=nosync affect direct (non-raft) callers that do NOT have raft-log replay as a durability backstop. Concrete production path: distribution.EnsureCatalogSnapshot -> CatalogStore.Save -> store.ApplyMutations. If the process crashed before Pebble flushed, those acknowledged writes could be lost with no raft entry to re-apply. This change splits the API: - ApplyMutations / DeletePrefixAt: always pebble.Sync. Safe for any caller, including those that bypass raft (catalog bootstrap, admin snapshots, migrations, tests). - ApplyMutationsRaft / DeletePrefixAtRaft: governed by s.fsmApplyWriteOpts (ELASTICKV_FSM_SYNC_MODE). Intended solely for the FSM apply loop; only the raft WAL fsync makes pebble.NoSync safe. kv/fsm.go is the only caller of the new *Raft methods. CatalogStore and every adapter test retain the always-sync ApplyMutations, so the nosync opt-in can no longer silently weaken durability outside the raft-apply path. MVCCStore / ShardStore / LeaderRoutedStore all implement both variants; the in-memory mvccStore has no WAL so both delegate to the same body. Tests: - Added TestDirectApplyWriteOpts_AlwaysSync asserting the direct path always resolves to pebble.Sync even when the FSM-apply mode is nosync. - Added TestDirectApplyMutations_NoSyncConfigured_StillWritesDurably for functional coverage of the public entry points under a NoSync-configured store. - Existing sync-mode tests + benchmark renamed to *Raft to reflect that the knob now only governs the raft-apply path. Addresses codex P2 on #592.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 34 minutes and 39 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (8)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Codex Review: Didn't find any major issues. Can't wait for the next one! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
…ni medium) Address two PR #636 review concerns: Codex P2 — `4 × 1 MiB = 4 MiB` exactly is *over* `MaxSizePerMsg` once protobuf framing overhead is counted. Each pb.Mutation carries the Op tag, Key tag + bytes, and Value length prefix; the pb.Request envelope wraps them; marshalRaftCommand prepends one byte. With a normal-length objectKey the encoded entry is ~4 MiB + 300 B, with a 1 KiB objectKey it grows to ~4 MiB + 4 KiB. Either way the entry falls into etcd/raft's util.go:limitSize oversized-first-entry path (the documented "as an exception" branch), bypassing the `MaxInflight × MaxSizePerMsg` per-peer memory bound the prior commit was trying to enforce. Fix: lower s3ChunkBatchOps from 4 to 3. `3 × 1 MiB = 3 MiB + few hundred bytes` leaves ~1 MiB of headroom even for kilobyte-scale keys, so the entry stays strictly under MaxSizePerMsg and the batched-MsgApp path applies. Per-PUT Raft commit count grows ~5× from the pre-PR-#636 baseline (5 GiB PUT: 320 → ~1707 entries) but each fsync is ~5× smaller; the WAL group commit landed in PR #600 absorbs the rate. Gemini medium — the previous commit also forced the cleanup paths (cleanupPartBlobsAsync, deleteByPrefix, cleanupManifestBlobs, appendPartBlobKeys) onto the same 4-op batch even though they carry no chunk payload. Those ops are pure-key Dels; a 4-op batch amplifies cleanup latency for large objects (5 GiB cleanup → ~1707 batches at the data-write size). Fix: introduce s3MetaBatchOps = 64. With BlobKey-shaped keys ~100 B each, 64 keys × 100 B ≈ 6 KiB per batch, still three orders of magnitude under MaxSizePerMsg. A 5 GiB-object cleanup commits ~80 batches instead of ~1707, so orphaned-blob garbage collection finishes proportionally faster and does not amplify Raft load relative to the data-write path. Updates the four cleanup callsites (adapter/s3.go:1879, 1900, 1933, 2206, 2227) plus the appendPartBlobKeys loop helper. Tests: - TestS3ChunkBatchFitsInRaftMaxSize builds a worst-case PutObject batch (1 KiB objectKey, fully-populated 1 MiB chunk values), marshals via the same protobuf path the real Dispatch uses, and asserts the encoded entry plus the marshalRaftCommand 1-byte framing prefix is strictly under MaxSizePerMsg = 4 MiB. A second assertion pins the headroom margin at >= 64 KiB so a future bump in s3ChunkBatchOps or s3ChunkSize is caught at PR time. - TestS3MetaBatchFitsInRaftMaxSize is the equivalent invariant for s3MetaBatchOps on the cleanup path. Build / vet / lint clean. S3 unit tests pass.
…s=4 (#636) ## Summary S3 PutObject の Raft entry サイズを `MaxSizePerMsg` (PR #593 で 4 MiB) と整列させます。 **変更:** `s3ChunkBatchOps = 16 → 4`。1 Raft entry = `s3ChunkBatchOps × s3ChunkSize ≒ 4 MiB` に揃える。1 行差分。 ## Why `etcd/raft` の `util.go:limitSize` には「単一 entry が `MaxSizePerMsg` を超えていても、その entry だけは reject せず単独で MsgApp に載せる」という documented exception があります (`"if the size of the first entry exceeds maxSize, a non-empty slice with just this entry is returned"`). 16 MiB entry はこの経路で素通りするので: | 項目 | s3ChunkBatchOps=16 | s3ChunkBatchOps=4 (本 PR) | |---|---|---| | 1 Raft entry size | ~16 MiB | ~4 MiB | | Leader worst-case / peer | 1024 × 16 MiB = **16 GiB** | 1024 × 4 MiB = **4 GiB** | | 3-node cluster (leader) | 32 GiB | 8 GiB | | WAL fsync per entry | 16 MiB | 4 MiB | | 5 GiB PUT の Raft commit 数 | 320 | 1280 (4×) | PR #593 の本文が謳う `1024 × 4 MiB = 4 GiB / peer` の memory bound は **小エントリ前提**で、S3 経路では成立していませんでした。本 PR で S3 が普通の batched MsgApp 経路に乗るようになり、bound が S3 込みでも正確になります。 Raft commit 数が 4× に増える代わりに、各 fsync が 4× 速くなるので、PR #600 の WAL group commit と相殺されてエンドツーエンド throughput はほぼ同等のはずです。 ## Test plan - [x] `go build ./...` clean - [x] `golangci-lint run ./...` 0 issues - [x] `go test ./adapter/ -short -run 'TestS3|S3Server'` pass - [ ] CI ## Follow-ups (別 PR で design doc) 1. **S3 admission control** — 同時並行の S3 PUT body bytes に hard cap を入れて、クライアント並列度が上がっても leader-side memory が `4 GiB × peers` を超えないようにする。 2. **Raft snapshot 戦略 / blob bypass** — follower fall-behind 時の snapshot transfer に 5 GiB blob が乗ると不都合なので、blob を Raft 経路から外して manifest だけ raft で同期するアーキテクチャ。 /gemini review @codex review
## Summary PR #636 (`s3ChunkBatchOps = 4` で Raft entry を `MaxSizePerMsg = 4 MiB` に整列) のフォローアップとして、design doc を 2 本提出します。コードは含まず docs だけ。 ### 1. `docs/design/2026_04_25_proposed_s3_admission_control.md` **問題**: PR #636 は per-entry の memory accounting を直すが、aggregate (= 同時 PUT 数 × 各 PUT の inflight) は client 並列度で青天井。 **提案**: ノードが受け入れて Raft に commit する前の S3 PUT body bytes を **共通セマフォで hard cap**。 - (A) `Content-Length` で request entry に pre-charge - (B) `flushBatch` の per-batch (4 MiB) で sub-lease — `dispatchAdmissionTimeout` 経過で 503 SlowDown - デフォルト 256 MiB cap (GOMEMLIMIT=1800 MiB の ~14%)、env で tunable - `Retry-After` 付き 503 → AWS SDK が自動 retry - M1〜M4 の段階導入、metrics/Grafana panel まで含む ### 2. `docs/design/2026_04_25_proposed_s3_raft_blob_offload.md` **問題**: 5 GiB PUT が WAL に丸ごと書かれ、follower catch-up と snapshot transfer が O(stored bytes) でスケールしない。 **提案**: Raft が運ぶのは `ChunkRef{SHA256, Size}` (~100 B) と manifest だけ。chunk 本体は別キー名前空間 `!s3|chunkblob|<SHA>` でローカル Pebble に直書き、follower は applyLoop で `chunkref` を見たら **非同期 fetch RPC で peer から取りに行く**。 - 1 GiB PUT の Raft 負荷: ~100 KiB に圧縮 (10⁴× 削減) - Snapshot size: O(manifest count) になる - Reference count + grace period の GC - `ELASTICKV_S3_BLOB_OFFLOAD=true` でオプトイン、legacy `BlobKey` 経路は並走 - M0 spike 〜 M4 migrator までの 5 段階ロールアウト 両方とも: - `docs/design/README.md` のファイル名・ヘッダ規約に準拠 (Status: Proposed / Date: 2026-04-25) - 関連 PR (#593, #600, #612, #617, #636) と既存 design (workload isolation, raft grpc streaming) を相互参照 - リスクと open questions を明示 ## Test plan - [x] markdown lint clean (textlint なし、手動確認) - [x] ファイル名規約 `YYYY_MM_DD_proposed_<slug>.md` 準拠 - [x] 既存 design へのリンクが docs/design/README.md と整合 レビューは設計の方向性 (admission control の cap 値、blob offload の content-addressing 単位、M0-M4 の milestone 配分) に focus してもらえると助かります。 /gemini review @codex review <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Documentation * Added design proposal for S3 PUT request admission control and load management strategy * Added design proposal for S3 object storage optimization approach <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Follow-up to #592 addressing the codex P2 correctness finding:
ELASTICKV_FSM_SYNC_MODE=nosyncpreviously affected direct (non-raft) callers ofstore.ApplyMutations/store.DeletePrefixAt, which have no raft-log replay as a durability backstop. Splits the API so the knob only relaxes fsync on the raft-apply path.ApplyMutations/DeletePrefixAtare now unconditionallypebble.Sync. Safe for any direct caller (catalog bootstrap, admin snapshots, migrations, tests).ApplyMutationsRaft/DeletePrefixAtRaftare the new raft-apply entry points; they observes.fsmApplyWriteOpts. Called only fromkv/fsm.go.Code evidence for the bug
distribution/catalog.go:630callss.store.ApplyMutationsdirectly fromCatalogStore.applySaveMutations.main.go:749wiresdistribution.NewCatalogStorearound the rawraftGroupRuntime.store(astore.MVCCStore), soEnsureCatalogSnapshot -> CatalogStore.Save -> store.ApplyMutationsis a production path that never goes through the raft apply loop. Undernosync, a crash before Pebble flush would lose acknowledged catalog writes with no raft entry to replay — exactly the failure mode codex P2 described.Before / after per call site
kv/fsm.go(raft apply: raw / 1PC / prepare / commit / abort / DEL_PREFIX)s.fsmApplyWriteOptss.fsmApplyWriteOpts(via*Raft) — unchangeddistribution/catalog.goCatalogStore.applySaveMutationss.fsmApplyWriteOpts(UNSAFE)pebble.Sync(always)adapter/*_test.godirectApplyMutations/DeletePrefixAtcallss.fsmApplyWriteOptspebble.Sync(always)kv/shard_store.go,kv/leader_routed_store.gowrappersTest plan
TestDirectApplyWriteOpts_AlwaysSyncasserts the direct path resolves topebble.Synceven when the FSM-apply mode isnosync.TestDirectApplyMutations_NoSyncConfigured_StillWritesDurablyexercises the publicApplyMutations/DeletePrefixAtentry points under a NoSync-configured store and confirms data is visible after a clean reopen.TestApplyMutations_NoSync*andBenchmarkApplyMutations_SyncModerenamed to*Raftto reflect that the knob now only governs the raft-apply path.go test -race -count=1 ./...— all packages pass.golangci-lint run ./...— 0 issues./gemini review
@codex review